Spam Corpus Creation for TREC

نویسندگان

  • Gordon V. Cormack
  • Thomas R. Lynam
چکیده

TREC’s Spam Filtering Track (Cormack & Lynam, 2005) introduces a standard testing framework that is designed to model a spam filter’s usage as closely as possible, to measure quantities that reflect the filter’s effectiveness for its intended purpose, and to yield repeatable (i.e. controlled and statistically valid) results. The TREC Spam Filter Evaluation Toolkit is free software that, given a corpus and a filter, automatically runs the filter on each message in the corpus, compares the result to the gold standard for the corpus, and reports effectiveness measures with 95% confidence limits. The corpus consists of a chronological sequence of email messages, and a gold standard judgement for each message. We are concerned here with the creation of appropriate corpora for use with the toolkit.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Overview of the TREC 2010 Web Track

The TREC Web Track explores and evaluates Web retrieval technology over large collections of Web data. In its current incarnation, the Web Track has been active for two years. For TREC 2010, the track includes three tasks: 1) an adhoc retrieval task, 2) a diversity task, and 3) a spam task. As we did for TREC 2009, we based our experiments on the billion-page ClueWeb09 data set created by the L...

متن کامل

A TREC Along the Spam Track with SpamBayes

This paper describes the SpamBayes submissions made to the Spam Track of the 2005 Text Retrieval Conference (TREC). SpamBayes is briefly introduced, but the paper focuses more on how the submissions differ from the standard installation. Unlike in the majority of earlier publications evaluating the effectiveness of SpamBayes, the fundamental ‘unsure’ range is discussed, and the method of removi...

متن کامل

A Discriminative Classifier Learning Approach to Image Modeling and Spam Image Identification

We propose a discriminative classifier learning approach to image modeling for spam image identification. We analyze a large number of images extracted from the SpamArchive spam corpora and identify four key spam image properties: color moment, color heterogeneity, conspicuousness, and self-similarity. These properties emerge from a large variety of spam images and are more robust than simply u...

متن کامل

Batch and Online Spam Filter Comparison

In the TREC 2005 Spam Evaluation Track, a number of popular spam filters – all owing their heritage to Graham’s A Plan for Spam – did quite well. Machine learning techniques reported elsewhere to perform well were hardly represented in the participating filters, and not represented at all in the better results. A non-traditional technique Prediction by Partial Matching (PPM) – performed excepti...

متن کامل

Document and Query Expansion Models for Blog Distillation

This paper presents the CMU submission to the 2008 TREC blog distillation track. Similar to last year’s experiments, we evaluate different retrieval models and apply a query expansion method that leverages the link structure in Wikipedia. We also explore using a corpus that combines several different representations of the documents, using both the feed XML and permalink HTML, and apply initial...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005